GROUP WORK - DEADLINE 15-Dec-23.
Please submit your final report using this form.

Radon, a natural radioactive gas, is a known carcinogen that can cause lung cancer, especially in high concentrations. In the United States, it’s estimated to lead to thousands of lung cancer deaths annually. Radon levels in homes across the country vary significantly, with some houses having dangerously high concentrations.

To pinpoint areas with high radon exposure, the Environmental Protection Agency conducted radon measurements in over 80,000 randomly selected houses nationwide. Our objective in analyzing this data was to estimate the range of radon levels in each of the approximately 3000 U.S. counties. This information would help homeowners decide whether to measure or reduce radon in their houses based on local conditions.

For the analysis, we organized the data hierarchically: houses within counties. If we were to consider multiple measurements within houses, it would create a three-level hierarchy of measurements, houses, and counties.

Creating a reproducible lab report

This time, you will need to use your own RStudio environment. You can also use one of the spaces you used for a previous lab, if you like!

  • Load the tidyverse, table1, GGally, broom, broom.mixed, lmerTest, faux, marginaleffects and the modelsummary packages, and any other packages you might deem useful.
  • Load the radon.rds data set into your workspace, and assign it into an object called finches.
radon <- read_rds("https://bit.ly/BMLR-Radon")
  • Knit your file to see that everything is working.
  • For each question, please add the text of the question, the code chunks and your answer.

Simulating the data

Simulating data is an important step when analyzing data, as it serves a range of purposes. First, it helps clarify our assumptions and expectations regarding treatment effects we might realistically observe, the types of interactions we should consider, the size and features of the data we need to establish the claims we want to make. We can work in a “clean” environment where variables are truly normal, there are no missing values and data is “well behaved”. Once we grapple with an ideal world we can slowly add complexities and see how they change the picture.

Secondly, simulating “fake” data is a general method to understand how statistical methods work when repeatedly sampling data. By automating the sampling processes, we can evaluate the model’s performance. We can define the parameters in our “fake” population and compare the models estimations and predictions with those true values, and we can include features to challenge our assumptions and improve our understanding. We simulate in order to test and validate our models before applying them to real data.

Third, simulating fake data helps find issues in our code. If your model struggles to recover true parameters with large samples or low data variance, it could signal a coding or conceptual problem. Overlaying simulated data with assumed and fitted models can aid in identifying these issues.

Finally, simulating synthetic data is crucial when planning new studies or collecting data. It provides a basis for what to expect, ensuring a better understanding of potential outcomes.

In the first part of this lab we are going to replicate some of the ideas presented in a post written by Michael Clark. You may consider reading the post before continuing.

We are going to use the faux package to simulate nested data (observations nested within clusters). First, we define the number of clusters n_cluster and the number of observations in each cluster, obs_per_cluster. For example, we might want to have 20 clusters with 4 observations in each cluster (to keep things simple, let’s stick to a “balanced” data-set).

obs_per_cluster <-  4
n_cluster <-  20

Now, let’s generate some data, where (level 1) observed outcomes are related to predictors in the following manner:

\[ Y = a_i + b_i\cdot X + \epsilon \\ \text{where:}~~~ \epsilon \sim \mathcal{N}(0, \sigma^2) \]

But rather than being regular population parameters, \(a_i\) and \(b_i\) are random intercepts and slopes, such that each cluster has its own, unique intercept and slope. The random intercept is a level 2 random effect:

\[ a_i = \alpha_0 + u_i \\ \text{where:}~~~u_i \sim \mathcal{N}(0, \sigma_u^2)\\ \] And the random slope is a level 2 random effect:

\[ b_i = \beta_0 + v_i \\ \text{where:}~~~v_i \sim \mathcal{N}(0, \sigma_v^2)\\ \] To keep things simple, we will make sure that the random intercepts and slopes are independent, that is, the correlation between \(u_i, v_i\) is zero \(\rho_{u,v}=0\).

As you can see, in this stage we have only one predictor at level one \(X\). We have no predictor at level 2. We also have to define six different population parameters. For example, we can choose \(\alpha_0 = 1, \beta_0 = 0.5\) and three different standard deviations: \(\sigma = 1, \sigma_u = 0.5, \sigma_v = 0.25, \rho_{u,v} = 0\).

Feel free to change these values if you would like to simulate different data.

# Fixed intercept
alpha_0 <-  ___ # choose a number you like, e.g. 1  

# Fixed slope
beta_0 <- ___ #  e.g. 0.5

# Residuals st.dev (level 1) 
sig  <-  ___ #  e.g. 1

# Random intercept st.dev (level 2) 
sig_u <-  ___ #  e.g. 0.5

# Random slope st.dev (level 2)
sig_v <-  ___ #  e.g. 0.25

# Independent random effects
cor <- ___ #  e.g. 0.5

Finally, here is the code that simulates our data in several steps. Run every step separately, and observe the data frame View(df) to its contents

# Create a dataset that can be reproduced
set.seed(888)  
# step 1: create the units 
df <- add_random(
    cluster = n_cluster,  # first the clusters
    obs = obs_per_cluster # second, the observations
    ) 

# Print the result 
# try to figure out what just happened
df |> 
  print(n = 15)

# step 2: Add anything at level 2
# Here we have random intercept/slope 
df <- df |>
  add_ranef(
    "cluster",
    u_i = sig_u,
    v_i = sig_v,
    .cors = cor
    ) |> mutate(
      a_i = alpha_0 + u_i, 
      b_i = beta_0  + v_i
    )

# Print the result 
# try to figure out what just happened
df |> 
  print(n = 15)


# step 3: add anything at level 1
df <- df |>
  mutate(
    x       = rnorm(n_cluster * obs_per_cluster),
    epsilon = rnorm(n_cluster * obs_per_cluster, 0, sig),
    y = a_i + b_i * x + epsilon
    )


# Print the result 
# try to figure out what just happened
df |> 
  print(n = 20)

# For the models, we really only need 
# the cluster id, the y and the x

df  <- df |> select(cluster, x, y)

# Print the result 
# try to figure out what just happened
df |> 
  print(n = 20)

So now we have our final observations: the clusters, the predictor x and the outcome y. Before continuing, please go back, set obs_per_cluster and n_cluster to larger numbers, to make sure our models work.

We will then test the data against three types of models:

  • Complete pooling: This model ignores the structure of the data, taking all observations to be uncorrelated and independent of one another. The model estimates just two fixed population parameters: the global intercept \(\beta_0\) and the global slope \(\beta_1\), and the formula you will use in your linear regression model is y ~ x.
  • No pooling: This model expands on the previous one. In addition to the global intercept and slope, we add the cluster (categorical) variable and estimate a coefficient for each cluster. In effect, each cluster is associated with a cluster specific intercept. Since we know that there is a variation in the slopes, you may also test a model with an interaction between the clusters and the predictor, x. The formula you will use in your linear regression model is y ~ x + cluster for the model without the interaction and y ~ x + cluster + x:cluster for the model with the interaction.
  • Multilevel model (partial pooling): This model also expands on the Complete pooling model to account for the correlated observations within the clusters. But instead of using fixed (dummy) variables, this model uses a random intercept (and, optionally, a random slope). The formula you will use for the random intercept model is y ~ x + (1 | cluster) and the formula or the random intercept and slope model is y ~ x + (x | cluster).
  1. Adjust the parameters, simulate your clustered data and present the results of the three types of models using the modelsummary package. Interpret your models and compare the estimations with each other, and with the six parameters used to simulate your data. Is there one single model that is superior to all the rest in terms of estimating the parameters? If so which is it, and what makes it better? If not, Which model works best for what purpose? In your answer, please pay attention to the standard errors and the Intraclass Correlation Coefficient (ICC) of your multilevel model(s). What does the ICC mean? How much does it deviate from what you would expect? For more about the ICC, please see below.

Further explanation and exploration: Use your imagination (and curiosity) to explore your simulations and your models. For example, you may want to compare multiple simulations with the same parameters, and see how they vary. You may want to change the sample size (the number of clusters and observations in each cluster), or you might want to increase the parameters, turn some of them to be negative, or just run multiple simulations with the same parameters, comparing the variation between one simulation and the next. Do the estimates vary a lot between simulations? Does increasing the sample size increase accuracy of the estimates? Does it influence the standard errors? Is there one single model that is superior to all the rest in terms of estimating the parameters? If so which model is it and in what way is it better? If not, which model works best for what purpose?

Calculating the ICC: The calculation of the ICC is straightforward when we only have random intercepts. Taking the random slopes into account complicates the situation because in that case, the ICC varies with the predictors. Because here the predictor is normally distributed around zero, the calculation of the ICC (at \(x=0\)) reduces to the regular form of the ICC, which need not take the random slope into account.

For example, in the example below, data is simulated with 30 clusters, 20 observations per cluster, a global intercept of \(2\) and a global slope of \(1.5\), the variance of the (level-one) residual is \(1\), the variance of the random intercept and the random slope are \(1.5^2\) and \(1.8^2\) respectively, and the correlation coefficient is \(0.5\).

You can see that the no-pooling models have larger standard errors for the intercept, probably because the estimation is based on data from one single cluster (the reference cluster with 20 observations and a small number of residual degrees of freedom). The no-pooling model with interactions is suffering from relatively large standard errors (most probably due to a small number of residual degrees of freedom). Also, note that for the no-pooling model, the global intercept is associated with the intercept of the reference cluster. So you are bound to get an estimate that is in effect the sum of the global intercept and the random intercept of the first cluster (\(=\alpha_0 + u_1\))

You will not necessarily get the same estimations as in the table below! Every simulation can be different, therefore the estimates will be slightly different, leading to different results.

Simulation parameter Complete pooling No pooling* No pooling + interaction* Multilevel model Multilevel model + random slopes

(Intercept)

α_0 = 2 1.844 (0.108)*** -1.557 (0.463)*** -1.868 (0.232)*** 1.844 (0.317)*** 1.896 (0.321)***

x

β_0 = 1.5 1.184 (0.115)*** 1.196 (0.092)*** -0.883 (0.231)*** 1.195 (0.092)*** 1.374 (0.366)***

cluster02

2.940 (0.654)*** 3.252 (0.327)***

cluster03

3.945 (0.654)*** 4.338 (0.327)***

x:cluster02

-0.957 (0.311)**

x:cluster03

3.590 (0.319)***

x:cluster04

2.395 (0.351)***

x:cluster05

-0.702 (0.378)+

x:cluster06

0.890 (0.303)**

SD (Intercept )

σ_u = 1.5 1.672 1.744

SD (x )

σ_v = 1.8 1.988

Cor (Intercept~x )

ρ = 0.5 0.616

SD (Observations)

σ = 1 2.068 1.028

Num.Obs.

600 600 600 600 600

R2

0.151 0.506 0.884

R2 Adj.

0.150 0.480 0.871

R2 Cond.

0.487 0.886

AIC

2873.1 2606.6 1794.9 2663.2 1978.3

BIC

2886.3 2747.3 2063.1 2680.8 2004.6

ICC

0.4 0.9

RMSE

2.64 2.01 0.98 2.02 0.98
*Only a small sample of cluster level fixed effects shown




  1. Use your models to create a caterpillar plot for the intercepts and lattice and/or spaghetti plots to show the features of your multilevel models. Describe and discuss the plots: what new light do they shed? What are the advantages or disadvantages of presenting the models using these plots?

To create the caterpillar plot, you can use the following code. Don’t forget to fill in the gaps!

# First, we need the list of clusters
clusters <- df |> 
  pull(cluster) |> 
  unique()

# Now let's fit our multilevel model
mdl_multilevel <- lmer(y ~ x + (x | cluster), df) 

# predictions of y for each cluster at x = 0
predict_y <- mdl_multilevel |> 
  ranef() |> 
  augment() |> 
  as_tibble() |> 
  filter(variable == "(Intercept)") |> 
  # Now we order the results by ascending estimates
  arrange(estimate) |> 
  mutate(cluster = fct_inorder(level)) 

# Finally, let's plot our caterpillar!
predict_y |> 
  ggplot(aes(xmin = ___, xmax = ___, y = ___, x = ___)) + 
    geom_pointrange( ) + 
    labs(x = "Estimate of the intercept (Y_hat when X=0)", y = NULL)

Radon in the home: a case study

Our aim in examining this data is to estimate how radon levels are distributed across the approximately 3000 counties in the United States. This information can empower homeowners to decide whether to test and address radon in their homes, using the best local knowledge available.

To conduct this analysis, we view the data hierarchically: houses (level one) within counties (level 2). If we were to delve deeper into multiple measurements within houses, we could create a three-tier hierarchy: measurements (level 1), houses (level 2), and counties (level 3). During the analysis, we consider an important factor: where the measurement was taken, whether in the basement or on the first floor. Radon, originating underground, can more readily enter homes built into the ground.

Additionally, at the county level, we used a a measurement of soil uranium available for each county. This hierarchical model enables us to create a regression model for individual measurements while accommodating consistent, unexplained variations among the 3000 counties.

This section closely follows chapter 12 in Gelman and Hill’s influential book Data analysis using regression and multilevel/hierarchical models. You will find the chapter available as an optional reading (link also available through Brightspace). The chapter is available on Perusall to allow you to share code, discuss, or ask questions. I will be looking at the discussion from time to time, so if you need help, please do not hesitate to @tag me 🤙

In that chapter you will find the background for the study, and an explanation of the variables. The original dataset and the code used to produce the book is available here.

The dataset consists of a large number of measurements of the gas Radon, and so in this document we will be including only the measurements from the state of Minnesota. Feel free to choose a different state if you like.

radon_mn <- read_rds("https://bit.ly/BMLR-Radon") |> 
  filter(state == "MN") |>
  # drop counties without measurements
  mutate(county = fct_drop(county))
  1. Create a table1 and use GGally::ggpairs to explore your data.

You will now visualize the estimated and confidence interval for the average log Radon across different counties, against the (jittered) number of measurements taken in said county. We will use the estimation to compare between two models: the no pooling model and the multilevel model with partial pooling, both without predictors. The horizontal line in each plot represents an estimate of the average radon level across all counties. This is a replication of figure 12.1 in the above mentioned chapter.

  1. Replicate the figures below and explain why the counties with fewer measurements have more variable estimates and larger higher standard errors. What is the main advantage and disadvantage of no-pooling analysis, as illustrated in the plot on the left compared with the multilevel model? Explain.

First, we need to associate each county with the number of observations.

radon_mn_count <- radon_mn |>
  count(county, name = "n.obs")

m.no_pool <- radon_mn |>
  lm(log.radon ~ 0 + county, data = _)

m.partial_pool <- radon_mn |>
  lmer(log.radon ~ (1|county), data = _)

m.no_pool |> 
  predictions(newdata = datagrid(county = unique)) |> 
  as_tibble() |> inner_join(radon_mn_count, by = "county") |> 
  select(county, estimate, conf.low, conf.high, n.obs) |> 
  ggplot(
    aes(
      x    = ___, 
      y    = ___, 
      ymin = ___, 
      ymax = ___
      )
    ) + 
  geom_pointrange(
    position = position_jitter(width = .5, height = 0)
    ) 
  1. In this exercise, you will replicate a set of models as shown below. This illustration is based on the ideas shown in figure 12.4 in the book. Please replicate this illustration with a twist: for example, you may choose a different set of counties than the one shown below or the one shown in the book, or you may try something different. Please explain the graphs. What do the dashed red and black lines show? What is the blue line? What determines whether the blue line would be more similar to the red line or the black line? How is this figure compare to the one shown in chapter 12 of the book?

Hints The three lines show the predictions of the three models: the complete pooling, the no pooling and the multilevel (partial pooling) models. Further hints for creating the plots will be available in the following few days. Thank you for your patience.

  1. Finally, we are going to add a fixed effect at the county level (level 2). This illustration is based on figure 12.6 in the book, and it shows the estimated county level coefficients, plotted vs. county level uranium levels, along with estimated multilevel regression line at level 2. After replicating this graph, please explain it briefly.

Hints for creating the plots will be available in the following few days. Thank you for your patience.

  1. Using the simulation techniques in questions 1 and 2, generate a synthetic dataset that would be similar to the radon data, in the sense that if you fitted the three models to your simulated dataset, you would get similar parameter estimations to the ones you got when fitting your models to the real Radon dataset. Explore your synthetic data using similar visualisations, and comment on similarities or differences between the Radon dataset and your simulated dataset.

Appendix

The caterpillar plots for the intercepts are shown below, calculated as the predicted value of the intercept and its confidence interval for each cluster. The confidence interval for the no-pooling model is the largest, and that the estimates preserve the rank more when comparing the no-pooling fixed effects model with the multilevel model with a random intercept.